Deep Learning Tutorial 
ETERS 
Hung-yi Lee 





Deep learning 
attracts lots of attention. 


* | believe you have seen lots of exciting results 
before. 










Deep learning trends 
at Google. Source: 
SIGMOD/Jeff Dean 
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This talk focuses"or the ‘basic techniques. 


Outline 


Lecture l: Introduction of Deep Learning 


Lecture Ill: Variants of Neural Network 


Lecture IV: Next Wave 
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Lecture |: 
Introduction of 
Deep Learning 





Outline of Lecture | 


Let's start with general 


machine learning. 


-Hello World” for Deep Learning 
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Machine Learning 
= Looking for a Function 


* Speech Recognition 


AE I He )e "How are you" 


* Image Recognition 


f )- "Cat" 





* Playing Go 


f( 


* Dialogue System 
a “Hi” )= “Hello” 


(what the user said) (system response) 


) - "p n^ 





(next move) 


Image Recognition: 


Framework 
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Image Recognition: 


Framework 





function«eutput:;;smonkey" “cat” 


Image Recognition: 


Framework 
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“MONKEY “eat oon "dog" 





Three Steps for Deep Learning 





Step 1: Step 2: 
define a set - goodness of 
of function function 





Deep Learning is so simple ...... 
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Three Steps for Deep Learning 





Step 1: Step 2: 
Neural - goodness of 
Network function 





Deep Learning is so simple ...... 
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Human Brains 


- .. Dendrites 
Stimulus 






Presynaptic 
cell 


Nucleus- 


"n = 


Axon 









Synaptic terminals 





: ͵ Postsynaptic cell 
Neurotransmitter 
ww ai bbt. com [1] [1 EL ET E] ET E] 





Neural Network 


Neuron 


Z= AW, +e +a,Ww, +:+++a,w, +b 


A simple function 


D- 


Activation 
function 






weights bias 
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Neural Network 





Neuron Sigmoid Function o(z) 




















Activation 


function 
weights bias 
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Neural Network 


Different connections leads to 
different network structure 









| 


E 
i + 


1 
| 








YNZ 


- - 


Each neurons can have different values 
of weights and biases. 


e 


B—E 


Weights and biases are network parameters @ 





Fully Connect Feedforward 
Network 
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Fully Connect Feedforward 
Network 























Fully Connect Feedforward 
Network 





Input vector cutout vector F(L.) ~ beet f (ol) ~ mie 


Given parameters ϐ, define a function 





Given network structure, define a function set 


Fully Connect Feedforward 


neuron 


Deep means many hidden layers 





Output Layer (Option) 


ο Softmax layer as the output layer 


Ordinary Layer 


_—(@— » -σί) 
In general, the output of 
network can be any value. 
62 to Js alz) 
May not be easy to interpret 
a — P — s, - ol) 
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Output Layer (Option) 


Probability: 
ο Softmax layer as the output layer Mi>y,>0 


γι 1 
Softmax Layer 
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Example Application 


| : HH Eo a? 





Input 


se ge ee 
Se ee es D 











The image 
iS PIA 


---φ--φ---ο---Φ--- 
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| | 
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16 x 16 = 256 
Ink > 1 Each dimension represents 
No ink 30 MM the confidence of a digit. 





Example Application 


* Handwriting Digit Recognition 


Neural 


Network 





What is needed is a 
function ...... 
Input: output: 
256-dim vector 10-dim vector 
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Example Application 


Layer1 Layer 2 Layer L 








T function set containing the 
candidates for 






































You need to decide the network structure to 
let a good functiemia your function set. 


Input Layer 1 Layer 2 Layer L Output 





—- Vi 


FAQ 





Output 
Layer Hidden Layers Layer 


* Q: How many layers? How many neurons for each 
layer? 


ποπ Το Κο + 


* Q: Can the structure be automatically determined? 
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Three Steps for Deep Learning 





Step 1: Step 2: 
Neural - goodness of 
Network function 





Deep Learning is so simple ...... 
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Training Data 


* Preparing training data: images and their labels 


The learning target is defined on 
the training data. 
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Learning larget 


—— e-e e o 
| l | | | 
1 j 


— See 


Ga ἐξ 
P m — ππαὶ 


| 
— we td oO’ 
t ———8 ---}ϕ-- 





i 
, $—— —$———À—— — 
] 
MAS d 
| ] | 
+—-¢ — o—¢— o 
] ] 
| | 
] ] ] I ] 
---φ--ϕφ---Φ---Φ-- 
] ' 


ji 4 1 1 
-+ — -#—— $——*———e — 4 
| 


---ϕ---ϕ---ϕ----ϕ--- 
i | 
| ] 

πο ο YS 
—— 9——4 





| i , $ 


Baer 


56 


EX-LTET TT TUI 
16 x 16-2 
Ink > 1 The learning target is ...... 


No ink > 0 
Input: > yı has the maximum value 
> y, has the maximum value 


om ΠΠ OOUOUU 


Input: 








A good function should make the loss 


LOSS of all examples as small as possible. 











Loss can be the distance between the 
network output and.target.. 





Total Loss: 


Total Loss 


For all training data ... 


As small as possible 


Find a function in 
function set that 
minimizes total loss L 


Find the network 
NE parameters θ” that 
minimize total loss L 
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Three Steps for Deep Learning 





Step 1: Step 2: 
Neural - goodness of 
Network function 





Deep Learning is so simple ...... 
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How to pick the best function 





Find network parameters θ᾽ that minimize total loss L 


Layer | Layer |+1 
Enumerate all possible values 







Networx mna o = 
{w1, W2, ) bo, b3, e] " 
: weights 


Millig meters 





E.g. soeech recognition: 8 layers and 


1000 
1000 neurons each layer 1000 
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Network parameters 0 = 


Gradient Descent  (w,w,..., b, bo, =} 





Find network parameters @* that minimize total loss L 


> Pick an initial value for w 





Total 


Random, RBM pre-train 
Loss L 


> Usually good enough 
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Network parameters 0 = 


Gradient Descent  (w, w, b, b.) 





Find network parameters @* that minimize total loss L 





> Pick an initial value for w 
> ,Compute 0L/dw 


| Negative = Increase w 
Positive => Decrease w 











http://chico386. pixnet. net/albam/phéto/171572850 





V / W 





Network parameters 0 = 


Gradient Descent  (w,w,..., b, bo, =} 





Find network parameters ϐ᾽ that minimize total loss L 





> Pick an initial value for w 


Total > Compute 0L/dw 
Loss L w & w — nôL/ðw 


Repeat 








n is called 
—nOL/Ow  .. “learning rate” \ 7 vt 


Network parameters 0 = 


Gradient Descent  (w,w,..., b, bo, =} 





Find network parameters ϐ᾽ that minimize total loss L 





> Pick an initial value for w 


Total > Compute 0L/dw 
Loss L w & w — nôL/ðw 


Repeat Until Ο1,/ ὂνν is approximately small 
(when update is little) 











CUT 
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Gradient Descent 


Compute 0L/dw, 





0.15 


Compute OL /0O0w; 





C 


ompute 61,/ 90: 


—u ðL/ðb; 
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OL 


Ow, 
OL 
OW» 
aL 
ab, 


gradient 


Gradient Descent 


Compute 0L/dw, Compute 0L/dw, 


02 — H— 15 ——À 
ui. —u ðL/ðw, -μοι,/ ðw 






m Compute δ1,/ ὂνν» Compute δ1,/ ὂνν» 
-0.] παπα) 0.05 —Á | |; 
—u OL/dw> --μ ðL/ ðw, 
Compute 0L/db, Compute 0L/db, 





o2 — á—À 0.10 


-μ ðL/ðb, —HOL/Ob, 
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Gradient Descent 


1 


— —REEnP [enl α σἰαγίιίηρ di — = 
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Gradient Descent 





= Color: Value of 
W2 οἱ Wy Total Loss L 
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Gradient Descent - Difficulty 


* Gradient descent never guarantee global minima 





Different initial point 
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in. Reach different minima, 
so different results 









- QW There are some tips to 
ο σος ee. help you avoid local 


20 minima, no guarantee. 
10 10 wow οἱ tix. M ry ΠΠ 1 
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You are playing Age of Empires ... 
You cannot see the whole map. 


-η 0L/0w,, —n 0L/Ow;) 
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Gradient Descent 


This is the “learning” of machines in deep 
learning ...... 





= Even alpha go using this approach. 


Actually ..... 


rono yr = Ιω” 4 πο y 
imm m | ` 
CAD Ss” uim 7 et Ὁ μας 





| hope you are not. too disappointed :p 





Backpropagation 


e Backpropagation: an efficient way to compute ðL /dw 


* Ref: 
http://speech.ee.ntu.edu.tw/^tlkagk/courses/MLDS 201 
5 2/Lecture/DNN9620backprop.ecm.mp4/index.html 





T khone libdnn 
lee; tNneano AKRE 
TensorFlow See 


B3 Microsoft 


Caffe CNTK 





Don't worry about OL /Qw, the.toolkits will handle it. 





Concluding Remarks 


Step 1: Step 2: 


define a set goodness of 
of function function 





Deep Learning is so simple ...... 
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Outline of Lecture | 


VON 


-Hello World” for Deep Learning 
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Deeper is Better? 


Word Error 
Rate (%) 


Layer X Size 


Not surprised, more 
parameters, better 
performance 





Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription 
Using Context-Dependent Deep Neural Networks." Interspeech. 2011. 


Universality Theorem 





Any continuous function f 


D SN 


᾿Ξ ὄᾱ-τᾱ᾿---Ὃ Ὁ Ἢ n n 
"COLL μα ES V» 





ΓΙ ΜΗΝ 
GG UE GAUGE RAT NA 






e > 
; I^ ή ή / , P 
NN NNI. v SOK N OCDE Ζ΄ ζ΄, 277 
Can be realized bya network «νυ, 
NNUS S μα 
y \ KAR IN TII > 9,94 η { ὁ P 
ith hidden | Ὅ 


Reference for the reason: 


(given enough hidden http://neuralnetworksandde 
neurons) eplearning.com/chap4.html 


Why "Deep" neural network not "Fat" neural network? 
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Fat + Short v.s. Thin + Tall 


The same number 
of parameters 














Shallow P R Deep 


Fat + Short v.s. Thin + Tall 





oeide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription 
Using Context-Dependent Déeép"Neural"'Networks." /nterspeech. 2011. 


Analogy | 


Neural network 





e Logic circuits consists of e Neural network consists of 
gates neurons 

ο A two layers of logic gates ο A hidden layer network can 
can represent any Boolean represent any continuous 
function. function. 


e Using multiple layers of 
neurons to represent some 
functions are much simpler 


e Using multiple layers of 


logic gates to build some B 
functions are much simpler 








This page is for EE background. 


Modularization 


* Deep > Modularization 


Classifier Girls with 
—— 
1 = long hair 


Classifier Boys with 
— 
2 long hair 








weak 
Image 


Classifier Girls with 


3 = short hair 





Classifier = Boys with [- 
— | 
4 = Short hair — 
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Each basic classifier can have 


Modularization 


sufficient training examples. 





e Deep > Modularization 








Boy or Girl? 


Basic 


Image - 
Classifier 












Classifiers for the 
attributes 
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Modularization 







can be trained by little data 


e Deep  Modularization 
Classifier Girls with 


C1 long hair 
P Boy or Girl? 


Boys with . 
ΙΟΠ{ Little data 







Basic 


Classifier OET 
short hair 


Longor / 
, short? ] 
Boys with 


Sharing by the short hair - 
following classifiers a i 
as module 


Image 


ww ai bbt. com [] E] EL ET D] ET E] 





Modularization 





"ww αν ( | —á* Oe 
^ The modularization is 
automatically learned from data. 


The most basic Use 1* layer as module Use 2"? layer as 









classifiers to build classifiers module 
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Reference: Zeiler, M. D., & Fergus, R. 
(2014). Visualizing and understanding 


Μ Od U la l| Zat | on convolutional networks. In Computer 


Vision-ECCV 2014 (pp. 818-833) 


e Deep > Modularization 
ARL YL uw 













The most basic Use 1% layer as module Use 214 layer as 
classifiers © to build classifiers module 
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Outline of Lecture | 


-Hello World” for Deep Learning 
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If you want to learn theano: 
http://speech.ee.ntu.edu.tw/^tlkagk/courses/MLDS 2015 2/L 
Ke à S ecture/Theano9620DNN.ecm.mp4/index.html 


http://speech.ee.ntu.edu.tw/^tlkagk/courses/MLDS 2015 2/16 
cture/RNN9620training9620(v6).ecm.mp4/index.htm| 





Tz X Very flexible 
MN L n ca [i O Need some 
Tensor effort to learn 


cl 


Easy to learn and use 
Interface of 
TensorFlow or K (still have some flexibility) 
Theano 


You can modify it if you can write 
keras TensorFlow or Theano 
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Keras 


* Francois Chollet is the author of Keras. 


* He currently works for Google as a deep learning 
engineer and researcher. 


ο Keras means horn in Greek 
* Documentation: http://keras.io/ 


* Example: 
https://github.com/fchollet/keras/tree/master/exa 


mples 
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lut 21:585 SECIS 


RR FH Keras ΛΠ 
Deep Learningtit RE 








hd 
MM | 


| : 4 x 
KURARE 
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Example Application 


* Handwriting Digit Recognition 


κε... κα . 


MNIST Data: http://yann.lecun.com/exdb/mnist/ 


"Hello world" for deep learning 


ο 0 0 0 0 0 
ο 0 0 0 0 ο 


0 0 0 0 0 ο 





Keras provides data sets loading function: http://keras.io/datasets/ 


WAN al gogi 





Step 3: pick 
define a set the best 
of function function 










pesca 
500]... 


é 5-29 


pere 
Softmax 


mamam m, -- — 
ee es | i E Γ 








Step 1: Step 2: Step 3: pick 
define a set goodness of the best 
of function function function 





Step 3: pick 


Ke ras define a set the best 
of function function 





Step 3.1: Configuration 





w<-w-—ndL/dw 
0.1 
Step 3.2: Find the optimal network parameters 


Te = = 7 HG 





Training data Labels 


Next lecture 
(Images) (digits. [1 d D U D. U [I 








Step 1: — Step 2: 
Ke ras define a set =y goodness of 


of function function 


Step 3: pick 


the best 
function 





Step 3.2: Find the optimal network parameters 





numpy array 
ΛΑ» 





28 x 28 ή {ΠΕ [vv 10 
=784 PA 
a 
Number of training examples Number of training examples 





https://www.tensorflow.org/versions/rd.8/tutorials/mnist/beginners/index.htm| 


Keras 


Save and load models 


Step 1: Step 3: pick 
define a set my TTE of my the best 
of function function function 


27^ Trained 
| = Neural VE 
OS Network 5:5 





http://keras.io/getting-started/faq/#how-can-i-save-a-keras-model 


How to use the neural network (testing): 
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Keras 


* Using GPU to speed training 
e Way 1 
e THEANO FLAGS=device=gpu0 python 
YourCode.py 
* Way 2 (in your code) 
* import os 


* os.environ| THEANO FLAGS"| = 
"device-gpuO" 
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Live Demo 
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Lecture Il: 
Tips for Training DNN 








| ™ 







Recipe of Deep Learning 
YES 


NO 





Step 1: define a i 
set of function Good Results on 
ES Testing Data? 

Overfitting! 


Step 2: goodness 
of function YES 
Good Results on 


Step 3: pick the di 
best function Training Data? 











Do not always blame Overfitting 





" Not well trained 
: S 
- 5 
2 ^ 56-layer - 
E ον B Over «πρ; 
- v^. 20-layer 
"0 6 0 


2 3 | 4 
iter. (194) 


— iter. (le4) 
Training Data Testing Data 


Deep Residual Learning for Image 
Recognition —— 
http://arxiv.org/abs/1512.03385 








Recipe of Deep Learning & 





YES 


Good Results on 
Different approaches for Testing Data? 


different problems. 


YES 





e.g. dropout for good results 


on testing data 
Good Results on 


Training Data? 
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Recipe of Deep Learning 


Testing Data? 


Good Results on 


| Training Data? 





Momentum 
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Choosing Proper Loss 





oA Vine 
oe W^ ) | 

9 

9 

9 

9 

9 

9 
yo 


Which one is better? 








Let's try it 


model. 


model. 
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Let's try It 


Square Error 0.11 
Cross Entropy 0.84 


Training 
as Cross 
0.7 Entropy 





Accura 
C 
I 





1 2 3 4 5 O 7 8 9 10 11 12 13 14 15 16 17 18 19 20 


Epoch 
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Choosing Proper Loss 


When using softmax output layer, 
choose cross entropy = 
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http://jmlr.org/procee 
dings/papers/v9/gloro 
t10a/glorot10a.pdf 
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0 0 
Recipe of Deep Learning VY 






Good Results on 
Testing Data? 


| η 


Good Results on 


| Training Data? 


Momentum 


model.fit(x train, y train, 


Mini-batch 


Mini-batch 








We do not really minimize total B [o 


Mini-batch > Randomly initialize 


network parameters 





Pick the 1% batch 

|, Ξ 1 -- 191 -- --. 
Update parameters once 
Pick the 214 batch 

L” = ++. 
Update parameters once 


Until all mini-batches 
have been picked 


one epoch 


TM Repeat the above process 








Mini-Datch 


T a F 1 
E m prn j -- Γ μα. === | r Ld E us jm. 7 mm 
"A nw ἜΑ. πα. Γ tor i a a Lm m ur Li 









Pick the 1% batch 

|, Ξ 1 -- 151 -- --. 
Update parameters once 

> Pick the 2" batch 

L” =l +08 -- 

Update parameters once 


αυ air 


Mini-batch 





a 
| — [31 = 


100 examples in a mini-batch 


Until all mini-batches 
Repeat 20 times have been picked 
ww ai θα. com 1.000010 one epoch 





Mini-batch 


Mini-batch 


Mini-Datch 








We do not really minimize total loss! 





> 








Randomly initialize 
network parameters 


Pick the 15 batch 
L' = [ΕΕ [η 4e 
Update parameters once 


> Pick the 2™ batch 


LY = 17+ +. 
Update parameters once 


L is different each time 


when we update 
parameters! 








Mini-patcn 


Original Gradient Descent With Mini-batch 

















MIini-batch is Faster 


Original Gradient Descent With Mini-batch 
Update after seeing all If there are 20 batches, update 
examples 20 times in one epoch. 


2 
— S 


SS I ——À— 


^ Ι — EL -- - ^E-ENT-T- on 


EE batch 
examples Can have the same speed --- 
(not super large data set) 





Testing: 


Mini-batch is Better! (NES 


Mini-batch 0.84 
No batch 0.12 


Training 





Accuracy 








0.1 | 


142345 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 


ww οἱ xd pochan 


Mini-batch 


Mini-batch 





Shuffle the training examples for each epoch 


Epoch 1 Epoch 2 


| 9 l 


Don't worry. This is the default of Keras. 


[4 
ο ο ο 
Ϊ - [10 [26 
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Mini-batch 


Mini-batch 








Recipe of Deep Learning 


Good Results on 
Testing Data? 


{| 


Good Results on 


| Training Data? 





Momentum 
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Hard to get the power of Deep ... 


Handwritting Digit Classification 


Results on Training Data 


S 
ON 
ον 
a 
Pd 
(ο 
bas 
- 
See 
-— 
q 


Deeper usually does not imply better. 


1 2 3 4 5 6 / 5 9 


Layers 





Let's try it 


3 layers 0.84 
9 layers 0.11 


= 
Te 


Training 


cg 
=| DO 





En 
cn 


Accuracy 
2o 
$ un 


9 layers 


= 
LU 





| c 
na 


1 2 3 45 6 7 8 9 10 11 172 13 14 15 16 17 18 19 20 


Epoch 
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Vanishing Gradient Problem 





based on random!? 





Vanishing Gradient Problem 


Smaller gradients 
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Hard to get the power of Deep ... 


Handwritting Digit Classification 


7^5 


In 2006, people used RBM pre-training. 
In 2015, people use ReLU. 


S 
o 
— | 

-- 

ar 

(ο 

bas 

- 

See! 

ο 
= 4 


65 


Layers 





ReLU 


* Rectified Linear Unit (ReLU) 
Reason: 
a 
σ(2) . 1. Fast to compute 
2. Biological reason 


3. Infinite sigmoid 
with different biases 


4. Vanishing gradient 
[Xavier Glorot, AISTATS’11] 
[Andrew L. Maas, ICML’13] problem 


[Kaiming He, arXiv’15] 
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ReLU 


A Thinner linear network 


| κ 
p UCN 
| | 60 A 
co f 4 P \ 
= Ves | | A y | 
] | 4 y | 








Do not have 
smaller gradients 
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Let's try it 
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Let's try it 


Sigmoid 0.11 


ReLU 0.96 
* 9 layers 


12 Trainin g 


0.4 Sigmoid 
0.2 


a SS SS LSS LS SS SS 





1] 2 3 4 5 6 7 B 9 1011 1? 13 14 15 16 17 18 19 20 
Epoch 
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ReLU - variant 


Leaky ReLU Parametric ReLU 





a also learned by 
maa gradient descent 





Maxout ReLU is a special cases of Maxout 





ο Learnable activation function [lan J. Goodfellow, ΙΟΜΙ13] 





~ You can have more than 2 elements in a group. 














Maxout ReLU is a special cases of Maxout 





ο Learnable activation function [lian J. Goodfellow, ΙΟΜΙ13] 


ο Activation function in maxout network can be 
any piecewise linear convex function 


* How many pieces depending on how many 
elements in a group 


2 elements in a group 3 elements in a group 
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Recipe of Deep Learning 


Good Results on 
Testing Data? 


Good Results on 





Momentum 
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lea rning Rates Set the learning 


rate n carefully 





Po M 










à. 
CON 


Total loss may ‘not decrease 
after each update 


0 
ww ai bbt. com 7 
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Set the learning 


Learning Rates 


rate n carefully 





—À | ME F iny : 
M νυ Γ ΙΓ | 












If learning rate is too large 








A500. — 


ey 











Total loss may not decrease 
after each update 





-4 -2 


0 
ww ai bbt. com 7 
1 


Learning Rates 


* Popular & Simple Idea: Reduce the learning rate by 
some factor every few epochs. 


ο At the beginning, we are far from the destination, so we 
use larger learning rate 


* After several epochs, we are close to the destination, so 
we reduce the learning rate 


° Eg. 1/1 decay: nt 2 n/Nt - 1 
ο Learning rate cannot be one-size-fits-al| 


* Giving different parameters different learning 
rates 
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Adagrad 


Original: w — w — naL / ðw 
ðL / ðw 


Parameter dependent 
learning rate 


Adagrad: w e w — 











Summation of the square of the previous derivatives 





Adagrad bi 
NR 








, MM 





W2 
0.1 20.0 
Learning rate: Learning rate: 
1 _ it o n 
v0.12 οι! *—* yar ~ 20 
n n ] n 


V012 022 0224 *—* 2024+102? 22 


Observation: 1. Learning rate is smaller and 
smaller for all parameters 
2. Smaller derivatives, larger 
learning rate; arid Vice versa 





Why? 


Larger 
derivatives 


Smaller 
Learning Rate 


Smaller Derivatives 


Larger Learning Rate 





2. Smaller derivatives, larger 
learning rate; arid Vice versa 





Not the whole story ...... 





e Adagrad [ohn Duchi, JMLR’11] 
e RMSprop 


« https://www.youtube.com/watch ?v=O3sxAc4hxZU 








* Adadelta [Matthew D. Zeiler, arXiv’12] 

ο "No more pesky learning rates" nsu arxiv'12 
ο AdaSecant [Caglar Gulcehre, arXiv'14] 

ο Adam [Diederik P. Kingma, ICLR'15] 


ο Nadam 
e http://cs229.stanford.edu/proj2015/054 report.pdf 
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Recipe of Deep Learning 


Good Results on 
Testing Data? 


Good Results on 


| Training Data? 





Momentum 
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Hard to find 
optimal network parameters 


Total 
LOSS 





Very slow at the 
plateau 





Stuck at saddle point- 


Stuck at local minima - 





The value afa network parameter w 


In physical world ...... 


e Momentum 


l How about put this phenomenon 


\ in gradient descent? 


9— ee 
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Still not guarantee reaching 
Momentum global minima, but give some 





cost 

Movement = 
Negative of dL/dw + Momentum 
——p- Negative of ΟΙ, / ðw 
s=: Momentum 
===> Real Movement 












μι!» MÁS» 


OL/Ow - 0 
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dam 





model .compile (loss= 
optimize 


"- 


Metrics 



























model.compile (loss= 
optimiz 
metrics 


Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details, 
and for a slightly more efficient (but less clear) order of computation. g? indicates the elementwise 
square g © g+. Good default settings for the tested machine learning problems are a = 0.001. 
B1 = 0.9, Bz = 0.999 and e = 1075. All operations on vectors are element-wise. With B; and B; 
we denote 5; and 5» to the power t. 
Require: o: Stepsize 
Require: Οι. 32 € [0, 1): Exponential decay rates for the moment estimates 
Require: /(0): Stochastic objective function with parameters 0 
Require: ĝo: Initial parameter vector 
mo + 0 (Initialize 1* moment vector) 
vo < 0 (Initialize 234 moment vector) 
t + 0 (Initialize timestep) 
while 0, not converged do 
t+t+1 
gt *— Vo ft(Ot—1) (Get gradients w.r.t. stochastic objective at timestep t) 
m4 «— By Τη. + (1 — 84) - αι (Update biased first moment estimate) 
Ut «— B5 - v1 + (1 — 02) - g? (Update biased second raw moment estimate) 
Mt ἐ-- m4/(1 — 81) (Compute bias-corrected first moment estimate) 
Ut «— w/(1— B ) (Compute bias-corrected second raw moment estimate) 
0, <— 06,11 — a - Πι/(ν οι + €) (Update parameters) 
end while 


retum Qj Bexdimg parner] 








Let's try it Testing; MEE SES 


Original 0.96 


* ReLU, 3 layer Adam 0.97 


Training 


= 


= 
Lo 





0.85 





Original 


Accuracy 
= 
x 


0.75 
0.7 
12345 6 7 8 9 101112 13 1415 10 171819 20 
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Recipe of Deep Learning Lb 


Early Stopping 


Regularization 


Good Results on 
Testing Data? 


Good Results on 
Training Data? 


Network Structure 
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Why Overtitting ? 
* Training data and testing data can be different. 


Training Data: Testing Data: 


Zid 


Learning target is defined by the training data. 





The parameters achieving the learning target do not 
necessary have good results on the testing data. 
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Panacea for Overfitting 


* Have more training data 
* Create more training data (?) 


Handwriting recognition: 












Created 
Training Data: 





Original 
Training Data: 














Shift 15 ' 
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Why Overfitting? 


S.T |- 
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* For experiments, we added some noises to the 
testing data 


l. 15432226: 


1: 4916132 38-01, 
33e-02, 
03717611e-01, 


21244512e400, 
81327713e+00, 


TEM 


09e*00 


f 


je-Ol " 


183904e-01. 


.24253185e-01, 
—-1.41847985e+00, 
9.01917779e-O1, 

.26943186e+00, 





Why Overtitting ? 


* For experiments, we added some noises to the 
testing data 


Clean 0.97 
Noisy 0.50 





Training is not influenced. 


ww ai bbt. com [] E] EL ET D] ET E] 








Recipe of Deep Learning b 


Early Stopping 


Weight Decay 


Good Results on 
Testing Data? 


Good Results on 
Training Data? 


Network Structure 
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Early Stopping 














Total 
Loss 
stop at Validation set 
NE — emet 
. Training set 
Epochs 
Keras: http://keras.io/getting-started/fag/tthow-can-i-interrupt-training-when- 


the-validation-loss-isntzdecreasing-anymore 








Recipe of Deep Learning Lb 


Early Stopping 


Weight Decay 


Good Results on 
Testing Data? 


Good Results on 
Training Data? 


Network Structure 
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Weight Decay 


* Our brain prunes out the useless link between 


neurons. 
Synaptic Density 


At birth 6 years old 14 years old 


Doing the same thing to SERIES S brain improves 
the performance. 





ww αἱ bbteGena: fbi Hit Brain, Famibes and Work Institute, Firma Shore. 1997; Founders Network side 


Weight Decay 


Layer 1 Layer 2 











Weight decay is one Useless 
kind of regularization 


(BHT) 
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Weight Decay 


| OL 
* Implementation Original: w «— w—1] — 


OW 
A =0.01 
Weight Decay: w «— EC. -n < 
Smaller M smaller 


Keras: http://keras.io/regularizers/ 


m [IU U UL D E] 








Recipe of Deep Learning b 


Early Stopping 


Good Results on 
Testing Data? 





Weight Decay 


Training Data? 









Network Structure 
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Dropout 


Training: 








» Each time before updating the parameters 
€ Each neuron has p% to dropout 





Dropout 


Training: 





C 
OL ONSO- 


KS A 
n ^ Thinner! 


» Each time before updating the parameters 
€ Each neuron has p% to dropout 
[—» The structure of the network is changed. 





6 Using the new network for training 





Dropout 






























Testing: ( ) 
—— QR η R_ 
SER DR OK 
ZOU O= 





> No dropout 
6 ifthe dropout rate at training is p96, 
all the weights times (1-p)96 


ϐ Assume that the dropout rate is 50%. 
If a weight w = 1 by training, set w = 0.5 for testing. 





Dropout - Intuitive Reason 








TRHY partner 


aes, ΠΕΜ » PIDA 
Ser 
RE αν y" 


xr OS x 
S eK 


ANC 
FRYER A 


» When teams up, if everyone expect the partner will do 
the work, nothing will be done finally. 
















» However, if you know your partner will dropout, you 
will do better. 


» When testing, no one dropout actually, so obtaining 
good results eventually" 2111100 





Dropout - Intuitive Reason 


e Why the weights should multiply (1-p)% (dropout 
rate) when testing? 

Training of Dropout Testing of Dropout 

Assume dropout rate is 5096 | No dropout 


Weights from training 















Weights multiply (1-p)96 


Pa 
9 
«€ E 7 -7 





Dropout is a kind of ensemble. 


Training 
Ensemble »et 





η 
= 


Train a bunch of networks with different structures 


wwaibbt.com [] E] EL ET D] ET E] 





Dropout is a kind of ensemble. 


Ensemble 


ea data x 








Dropout is a kind of ensemble. 


Using one mini-batch to train one network 
> Some parameters-in-the network are shared 


Training of 
Dropout 





M neurons 


4 


2M possible 
networks 





Dropout is a kind of ensemble. 


Testing of Dropout tetur deka > 


All the 
weights 
multiply 


(1-p)% 





Y2 
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More about dropout 


* More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi, 
NIPS'13][Geoffrey E. Hinton, arXiv'12] 


e Dropout works better with Maxout [lan J. Goodfellow, ICMU13] 
ο Dropconnect [Li Wan, ΙΟΜΙ’13] 
* Dropout delete neurons 
* Dropconnect deletes the connection between neurons 
e Annealed dropout [S.J. Rennie, SLT’14] 
* Dropout rate decreases by epochs 
e Standout [J. Ba, NISP’13] 


* Each neural has different dropout rate 
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Let's try it 


ZZ 
πμ! ——1 mode 
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J ο 
Let S L [V | L — BÀ Good Results on 


Testing Data? 


Step 2: Learning 
Target 


N O D go eje, ut Good Results on 


Step 3: Learn! Training Data? 





2 ——2. HE = 2 
me Neural 
X Network κα 

S mmm mc. M um S 


Dropout 


Accuracy 





Testing: 


Trainin 
z ΟΕ iron 


1234567 8 9101112131415 NOISY 0.50 


+ dropout 0.63 





man POEM ,, 





Recipe of Deep Learning —Ó 


Good Results on 
Testing Data? 


Good Results on 
Training Data? 


Network Structure 


CNN is a very good example! 
(next lecture) 





Concluding Remarks 
of Lecture ll 





Recipe of Deep Learning 


Step 1: define a 
set of function 


Step 2: goodness 
of function 


Step 3: pick the 
best function 











Good Results on 
Testing Data? 


Good Results on 
Training Data? 


Lets try another task 





Document Classification 


“stock” in document 





νην, ὶ 


Hf 





Zn μα... 
moti M 













youn) m 


ΠµΗ zo 
> 5 AS ΕἼ 


swan - "president" in document 








~ DOC 














Bu ΕΙ 


http://top-breaking-news.com/ 
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Data 


t.shape 
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array {1 
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5 6 7 8 9 101112131415 16171819 20 
Epoch 


-w MISE. omen CE 





ReLU 
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- 
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σι 


0.4 


Accuracy 


-= 
X 
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12345 67 8 9 101112 131415 16 17 18 19 20 
Epoch 


== MSE ο μαι |) 
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Adaptive Learn 


Accuracy 
o oo 
L- πι ο —H 


c 
Pd 


c 





τ c : 


CE 





MISE 


0.36 
0.55 
0.75 
0.77 


12345 6 ἃ 9 101112131415 161/1819 20 


Epoch 


—eMSE οτί ——HeLU 
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----- Adam 


Accuracy 





Dropout Adam 0.77 
+ dropout 0.79 





Accurac 
C 
JI 


1 2 3 4 5 6 7 B 9 10 11 1? 13 14 15 16 17 18 19 20 


Epoch 


——w/o dropout ---ἠγοροιῖ 
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Lecture III: 
Variants of Neural 
Networks 





Variants of Neural Networks 


Convolutional Neural 
Network (CNN) Widely used in 


image processing 


Recurrent Neural Network 
(RNN) 








Why CNN for Image? 


* When processing image, the first layer of fully 
connected network would be very large 


100 





100x100x3 1000 


Can the fully connected network be simplified by 





considering the properties of image recognition? 


Why CNN for Image 


ο Some patterns are much smaller than the whole 
image 


A neuron does not have to see the whole image 


to discover the pattern. 


Connecting to small region with less parameters 














“beak” detector 
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Why CNN for Image 


* The same patterns appear in different regions. 


“upper-left 
beak” detector 






Do almost the same thing 


They can use the same 
set of parameters. 





“middle beak” 
detector 
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Why CNN for Image 


ο Subsampling the pixels will not change the object 


— 


I subsampling 


bird 
bird 





We can subsample the pixels to make image smaller 


Less parameters for the network to process the image 


ww ai bbt. com [] E] EL ET D] ET E] 








Three Steps for Deep Learning 


| Step 1: | Step 2: 


Convolutional - goodness of 


Neural Network function 
| 





Deep Learning is so simple ...... 





CDC.TENCENT.COM 
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The whole CNN 


cat dog ...... 





WOES 
Fully Conne 












Convolution 


Max Pooling 


Max Pooling. 


Can repeat 
many times 


The whole CNN 
Property 1 


> Some patterns are much 
smaller than the whole image — --- 
> The same patterns appear in Max Pooling 


| | Can repeat 
different regions. s many —- 















Max Pooling 


The whole CNN 


cat dog ...... 










NX Nt Py 






SS | 


Fully Connected 
network 









Convolution 


Max Pooling 





Can repeat 
many times 
Convolution 


Max Pooling 





CNN — Convolution 


Those are the network 
parameters to be learned. 





6 x 6 image 


Property 1 MM filter detects a small 
iii pattern (3 x 3). 





CNN — Convolution 





stride=1 


ΕΕ ΕΙΠΕ 


1000/10 
1100/10, 
0|0,1|0j1|0 





6 x 6 image 





CNN — Convolution 





If stride=2 


We set stride=1 below 





6 x 6 image 
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stride 








ox ZR. 
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6 x 6 image 





(f ej- 
AciclAcie 





CNN — Convolution 





stride=1 Do the same process for 
every filter 


Feature ; 





MEIe 


6 x 6 image 
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You will get another 6 x 6 
images in this way 








CNN — Colorful image 





Colorful image 





The whole CNN 


cat dog ...... 


















Ss A P HT 
iu DSTI 


NS : 
SS. 


Fully Connected 
d network Convolution 


Max Pooling 












Can repeat 
many times 











Max Pooling 





Flatten 


CNN — Max Mu 





Filter 2 


CNN — Max Pooling 


New image 
but smaller 













11900101. 
01,001 of 
00,1]1/0]0 
1900/10 
1100/10 
0|0j1|0j1|0 


6 x 6 image 






2 x 2 image 


Each filter 
——— is a channel 





Max Pooling 
ἽΝ Can repeat 
many times 






Smaller than the original — 
Image 


The number of the channel Max Pooling 
is the number of filters 
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The whole CNN 





cat dog ...... 









Convolution 







eere 


uten 
ae t τὸ a N . 
Vii) Gass \ 
(πη É 













os Max Pooling 
“a A new image 





aa 


Fully Connected 
— — Convolution 





= ee 


Max Pooling 


A new image 


Flatten 


Flatten 
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The wnole CNN 





Convolution 

Max Pooling 
Can repeat 
many times 


Convolution 


Max Pooling 
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b AN δὴ 

μας. 
Qa. 2 
7999 "9€ 
convolutio BOG 


image pooling 


(Ignoring the non-linear activation fünction after the convolution.) 






©) 












(Ὁ Filter 1 





ofifo an 
ojo 1110 
1|0|0 0| 1077 
01,0010 
0|0,1|0j1|0 


6 x 6 image 


Less parameters! 










κ.) 
4 
d 


F 


6 x 6 image 


Less parameters! 





Even less parameters! | 
rea bx.com ΠΠΠΠΠΠῃΠ b 









HE 
aaa a 
bath tated 


(Ignoring the non-linear activation fünction after the convolution.) 





Ty 


Max 
pooling 














Dim =6x6= 36 






parameters = | | 
26x32- 1152 = Dim =4x4x2 





\ / 


pooling 


“convolution | M ) ( 
| \ | \ ( 
Only9x2=18 | 


parameters "nnno 





Convolutional Neural Network 





Step 1: 


Step 2: Step 3: pick 
goodness of - the best 
function function 


Convolutional 
Neural Network 


» 
Convolution, Max 
Pooling, fully connected 





|—"monkey"«—» [i] 
> — ‘cat"«—_> M 





CNN 





E "dog'e— 9] 
target 


Learning: Nothing special; just gradient descent ...... 


Playing Go 





Next move 
(19 x 19 
positions} 





19 x 19 vector 


(image) 


Black: 1 Fully-connected feedword 
network can be used 





white: -1 
none: 0 But CNN performs much better. 
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HERE vis. ΕΕ 
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u CC LL — HAYS 
Training: record of pr 


Playing Go 





Target: 


“RT” — 1 
else =O 
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Why CNN for playing Go? 


ο Some patterns are much smaller than the whole 


image 
Alpha Go uses 5 x 5 for first layer ave 


* The same patterns appear in different regions. 
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Why CNN for playing Go? 


ο Subsampling the pixels will not change the object 





VERE How to explain this??? 


Neural network architecture. The input to the policy network is a 19 x 19 x 48 
imase s stack η. of 48 feature planes. The first hidden layer zero pads the 
ο a | ace, then convolves k filters of kernel size 5 x 5 with stride 
1 with the leot image and applies a rectifier nonlinearity. Each of the subsequent 
hidden lavers 2 to 12 zero pads the respective previous hidden layer into a 21 x 21 
image, then convolves K filters of kernel size 3 x 2 with stride 1, again followed 
bya rectifier para The final layer πας αι 1 filter of kernel size 1 x 1 
with stridg i i iene aaan aX func- 
tion. The Alpha Go does not use Max Pooling . ixtended 
Data Table 3 additionally show the results of training with k — 128, 256 and 
384 filters. mwai bit. com ΠΠ ΠΠ Π ΠΠ 

















Variants of Neural Networks 


Recurrent Neural Network 


| RNN | Neural Network with Memory 








Example Application 


* Slot Filling 


| would like to arrive Taipei on November 2". 


* 





ticket booking system 


* 


Destination: Taipei 








Slot 
time of arrival: November 2/6 
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Example Application 


Solving slot filling by 
Feedforward network? 
Input: a word 


(Each word is represented 
as a vector) 





Taipel =» 


ww ai bbt. com [] E] EL ET D] ET. E] 








1-of-N encoding 


How to represent each word as a vector? 


1-of-N Encoding lexicon = (apple, bag, cat, dog, elephant} 


The vector is lexicon size. apple=[1 O 
Each dimension corresponds bag =[0 1 
to a word in the lexicon cat -[0 0 
The dimension for the word dog =[0 0 


is 1, and others are O 


elephant =[0 O 
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ο 0 οἱ 
ο O 0| 


1 0 Οἱ 
ο 1 0| 
ο O 1 


Beyond 1-of-N encoding 





Dimension for “Other” Word hashing 

apple 0 8-8-8 
bag ©) 0 a-a-b 
cat Q 3-p-p 

dog Æ 0 26 X 26 X 26 
elephant τ... 0 die 
p-p- 

"other" Jr 1 8 -: » 
Y : w = “apple 





w = "Gandalf" w= “Sauron” 
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187 


Example Application time of 


dest departure 


Solving slot filling by ^ 92 


Feedforward network? y 
Input: a word 2 D 
(Each word is represented 
as a vector) 


Output: 


Probability distribution that 
the input word belonging to 


the slots 
Taipei =» 
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Example Application time of 


arrive | Taipei on November 214 


dest departure 





other dest other time time 


leave | Taipei on November 24 





place of departure 


Neural network Taipei = 





needs memory! 
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Three Steps for Deep Learning 


| Step 1: | 





Step 2: 


Recurrent - goodness of 


Neural Network function 
| 





Deep Learning is so simple ...... 





CDC.TENCENT.COM EU 
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Recurrent Neural Network (RNN) 


The output of hidden layer 
are stored in the memory. 


store 


Memory can be considered 
as another input. 
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R N N The same network is used again and again. 


Probability of Probability of Probability of 
"arrive" in each slot “Taipei” in each slot “on” in each slot 





arrive Taipei on November 214 
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RNN Different 


Prob of “Taipei” 
in each slot 




























Prob of “arrive” 
in each slot 


Prob of “leave” 
in each slot 






Prob of “Taipei’ 
in each slot 







































































arrive 


The values stored in the memory is different. 














Of course it can be deep ... 








Bidirectional RNN 














Long Short-term Memory (LSTM) 


Other part of the network 









Special Neuron: 
Signal control — 4 inputs, 
the output gate αι iin 1 output 


(Other part of 


the network 
| Signal control 


the forget gate 
(Other part of 
the network} 





Signal control 
the input gate 
(Other part of 
the network) 





Other part of thé network 


Zo 


& a =h(c')f (zo) 


multiply 











Output Gate , 





Activation function f is 
usually a sigmoid function 





Between O and 1 


Mimic open and close gate 
Zf 


c' = g(z)f (zi) + cf (zr) 
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| Input Gate 
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Input Gate 
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Multiple-layer 
LSTM 












This is quite 
standard now. 


O 


https://img.komicolle.org/201 


HOU LO 


5-09-20/src/14426967627131. gif 





Three Steps for Deep Learning 


Step 1: Step 2: 


define a set goodness of 
of function function 





Deep Learning is so simple ...... 





= wwa bbt.com 0000000 


Learning Target 






Training 
Sentences: 









arrive Taipei on November 





other ""est»""gther time time 


Three Steps for Deep Learning 


Step 1: Step 2: 


define a set goodness of 
of function function 





Deep Learning is so simple ...... 
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Learning 





 Backpropagation 
through time (BPTT) 





RNN Learning is very difficult in practice. 








Unfortunately ...... 


e RNN-based network is not always easy to learn 


Real experiments on Language modeling 


Total Loss 





1 2 3 » 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 


waw ai " Ppoch ΠΠ 


The error surface is rough. 


The error surface is either 


very flat or very steep. 








0.35 





0.30 
0.25 5 
0.20 = 
0.15 T 
0.10 
0.05 
-2.8 l W [Razvan Pascanu, ΙΟΜΙ13] 
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Small 
Learning rate? 


Large 
Learning rate? 





Toy Example | | 


l 


W 
i 


mah, [LL ELE EL E] E] 0 


I 
— 
I 









Helpful Techniques 





° το deal with ο vanishing (not gradient 
explode) 
» Memory and input are qum 


_ Output Gate 


added 


» The influence never disappears 
unless forget gate is closed 


No Gradient vanishing 
(If forget gate is opened.) 


Gated Recurrent Unit (GRU): 





Input Gate 





simpler than LSTM 








'[Cho, EMNLP'14] 


Helpful Techniques 


Structurally Constrained 
Recurrent Network (SCRN) 


Clockwise RNN 









αι 
NI 






se 
LESE 
ο 
τὰ A 
Sz 
Ὦ ha 
c VE 





[Jan Koutnik, JMLR'14] [Tomas Mikolov, ICLR’15] 


Vanilla RNN Initialized with Identity matrix + ReLU activation 
function [Quoc V. Le, arXiv’15] 


> Outperform or be comparable with LSTM in 4 different tasks 
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More Applications ...... 


Probability of Probability of Probability of 


"arrive" in each slot “Taipei” in each slot “on” in each slot 






Input and output are both sequences 


with the same length 


3? 

















E ëE 
RNN can do more than that! 





arrive Taipei on November 24 
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Keras Example: 


Many tO one _ https://github.com/fchollet/keras/blob 


/master/examples/imdb lstm.py 


* Input is a vector sequence, but output is only one vector 


EH TH 
wE 


DIAS 
H FH 


AE 


AE 


Sentiment Analysis 





























-00000 





I 





Many to Many (Output is shorter) 


* Both input and output are both sequences, but the output 
is shorter. 
* E.g. Speech Recognition 


Output: “ΧΡ” (character sequence) 


Problem? t 


| 
ΙΙ 









(vector 
sequence} 





Many to Many (Output is shorter) 


* Both input and output are both sequences, but the output 
is shorter. 
* Connectionist Temporal Classification (CTC) [Alex Graves, 


ICML'0O6][Alex Graves, ICML'14|[Hasim Sak, Interspeech'15]lJie Li, 
Interspeech'15][Andrew Senior, ASRU'15] 


“CERES” 


* 





ἷ 
f 





a 
Lo — 


11 
[ 


=- 





[ i [ . 








Many to Many (No Limitation) 


* Both input and output are both sequences with different 
lengths. > Sequence to sequence learning 


* E.g. Machine Translation (machine learning >a") 
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Many to Many (No Limitation) 


* Both input and output are both sequences with different 
lengths. > Sequence to sequence learning 
Z2 Hk 


* E.g. Machine Translation (machine learningO ts 4) 


ας Ἡν ἃς Es 
T 








| 
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Many to Many (No Limitation) 


06/12 
06/12 
06/12 
06/12 
06/12 
06/12 


06/12 
06/12 
06/12 





Ref:http://zh.pttpedia.wikia.com/wiki/96E6968E96A596E996BE968D96 
EG968E96A896E696969687 (HREM EH nua 





Many to Many (No Limitation) 


* Both input and output are both sequences with different 
lengths. > Sequence to sequence learning 


* E.g. Machine Translation (machine learning [Z3 5 4) 











Add a symbol “===“ (Jr) 














WHya Sutskevern NIPS'14][Dzmitry Bahdanau, arXiv'15] 


One to Many 


* Input an image, but output a sequence of words 
[Kelvin Xu, arXiv'15][Li Yao, ICCV' 15] 


A vector 
for whole 
image 





Input 
Image Caption Generation 
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Application: 
Video Caption Generation 






i A 
Tir” 
Lm 
dmm 
AITITITO 


|. ΠΕ $ uD ΜΗΝ 


A group of people is A group of people is 
knocked by a tree. walking im the forest. 


Video Caption Generation 


* Can machine describe what it see from video? 
" Demo: TED] > “ΠΗ͂Ι ^ IERI 
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Concluding Remarks 


Convolutional Neural 
Network (CNN) 


Recurrent Neural Network 
(RNN) 








Lecture IV: 
Next Wave 





Outline 






Supervised Learning 


Ultra Deep Network 


e Attention Model 






} New network structure 


Reinforcement Learning 


Unsupervised Learning 





e Image: Realizing what the World Looks Like 
e Text: Understanding the Meaning of Words 
e Audio: Learning human language without supervision 
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Skyscraper 























800 metres 
700 
600 - 1 ο ~ ~ 
500 
400 
300 
200 
100 
0 metres sies ^. Empire State Tape 101 CN Tower Varsaw Radio Mast τ. 
Giza) (New York) (Taipei) (Toronto) (Gabin) 
Eiffel Tower Petronas Towers World Trade Center KVLY-TV Mast Burj Khalifa 
(Paris) (Kuala Lumpur) (New York) (Blanchard) (Dubai) 


https://zh.wikipedia.org/wiki/96E9969B969996E596B396B096E596A196941t/me 
dia/File:BurjDubaiHeight.svg 
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Ultra Deep Network 


http://cs231n.stanford.e 
du/slides/winter1516 le 
cture8.pdf 




















a NES - 
o bos. AES 
5 
Š 
ο. o os 
aes Δ: ἢ 


AlexNet (2012) 


| opw 
EC-T000 
ες-φοθθ 
ες-οορ — 


mu 19 layers 


| CUMPIS 
/ wexboo| - 
/ CuMPIS 
| COUMPIS - 
os 
7.3% Peursa 
i 
| COUA-TSB - 
| COUATS8 - 
^ wsxboo| - 
uu 
| COUMed — 
a 





WGG(2014) 





ΠΠ! 


a 


Ultra Deep Network 





101 layers 


152 layers 2 


Hid 


7.33008 6.7%" 
16.4% . = DE 


AlexNet VGG GoogleNet Residual Net Taipel 
(2012) (2014) "(2014)" (2015) 101 


















Ultra Deep Network 


Worry about overfitting? ο, 























Worry about training 
first! 


This ultra deep network 3.57% € 
have special structure. 


7.3% Ξ = 

16.4% ` E E -- 
AlexNet VGG GoogleNet Residual 
(2012) (2014) "(2014)" ^ (2015) 


νυ Ὧν... ἵν΄ αν, PS UHR LLL L 


iiti 
ED 
~ 
ὃς 


i 


i 











Ultra Deep Network 


* Ultra deep network is the 
ensemble of many networks 
with different depth. 


6 layers 
Ensemble 4 layers 


2 layers 


Residual Networks are Exponential 
Ensembles of Relatively Shallow 
Networks | 
https://arxiv.org/abs/1605.06431 





Ultra Deep Network 


ο FractalNet 


FractalNet: Ultra-Deep 
Neural Networks without 
Residuals 
https://arxiv.org/abs/1605.0 


1648 l 

Resnet in Resnet 

Resnet in Resnet: Generalizing 
Residual Architectures 
https://arxiv.org/abs/1603.080 
29 


Good Initialization? 
All you need is a good init 
http://arxiv.org/pdf/ 1511.06422v7.pdf_ sono falz) 
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Block 2 


n 


Block 3 


Ultra Deep Network 


e Residual Network  * Highway Network 


| | 
Gate 
- 
controller Ευ. “MHF 


Deep Residual Learning for Image Training Very Deep Networks 
Recognition https://arxiv.org/pdf/1507.062 
http://arxiv.org/abs/1512.03385 -m 0010:28v2.pdf 





output layer | output layer output layer 





HHH 


Fb d 
i dd 


Highway Network automatically 


determines the layers needed! 
Input layer wwabicem GTN) | input ayer 





Outline 


Supervised Learning 





e Ultra Deep Network 


ο Attention Model 


Reinforcement Learning 


New network structure 





Unsupervised Learning 





e Image: Realizing what the World Looks Like 
e Text: Understanding the Meaning of Words 
e Audio: Learning human language without supervision 
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Attention-based Model 


What you learned Lunch today 
in these lectures 








What is deep 
learning? 


summer 
vacation 10 
years ago 





Answer @ Organize 


http://henrylo1605.blógsBbt'tw/2015/05/blog-post 56.html 


Attention-based Model 







yay — output 


Reading Head 


Ref: 
http://speech.ee.ntu.edu.tw/^tlkagk/courses/MLDS 2015 2/Lecture/Attain9620(v3).e 
cm.mp4/index.html waw ai bbt. com 0000000 





Attention-based Model v2 


Input. —— dV) — output 


Reading Head Writing Head 
Controller Controller 


-:--. oce 
Writing Head Reading Head 









waseennonnonn Neural Turing Machine 





Reading Comprehension 


Tu το ih) > 










Reading Head 
Controller 


Sema ie 





Each.sentence becomes a vector. 








Reading Comprehension 


* End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. 
Weston, R. Fergus. NIPS, 2015. 


The position of reading head: 
Brian isa frog. 


Lily is gray. 
Brian is yellow. 


— is green. 





Keras has example: 


https://github.com/fchollet/keras/blob/master/examples/ba 
bi memnn.py 
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Visual Question Answering 





Al System 





What is the mustache 
made of? 


source: http://visualqa.org/ 
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Visual Question Answering 


Query —— DNN/RNN — MSS 


Reading Head 
Controller 








A vector for 
nie - each region 





Visual Question Answering 


* Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring 
Question-Guided Spatial Attention for Visual Question 
Answering. arXiv Pre-Print, 2015 


Is there a red square on the bottom of the cat? 
GT: yes Prediction: yes 
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Speech Question Answering 


* TOEFL Listening Comprehension Test by Machine 
* Example: 
Audio Story: (The original story is 5 min long.) 
Question: " What is a possible origin of Venus’ clouds? " 
Choices: 
(A) gases released as a result of volcanic activity 


(B) chemical reactions caused by high surface temperatures 


(C) bursts of radio energy from the plane's surface 


(D) strong winds that blow dust into the atmosphere 
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Experimental setup: 


Simple Baselines 717 for training, | 
124 tor validation, 122 for testing 


(2) select the shortest (4) the choice with semantic 
choice as answer most similar to others 





Accuracy (%) 


random 





Naive Approaches: 


Model Architecture 















"" It be quite possible that this be 
due to volcanic eruption because 
volcanic eruption often emit gas. If 
that be the case volcanism could very 
well be the root cause of Venus 's thick 
cloud cover. And also we have observe 
burst of radio energy from the planet 
'ς surface. These burst be similar to 
what we see when volcano erupt on 





Answer 


Select the choice most ποι 


similar to the answer 


Semantics 








Semantic Speech Semantic 


Recognition Analysis 


Analysis 





Question: “what is a possible Audio Story: 
origin of Venus‘ cloüds?" ^" 





Model Architecture 


Word-based Attention 





Model Architecture 


Sentence-based Attention 








on Ra. ao V. 
i ers. A e, i | 
Attention : > 7 1 å " 
i Wy W W, W, Ws. We W We 
Μι Wr .. Wr ! Sentence 1 Sentence 2 


Question | ο Ue Story 





Attention | 


| Ws Ws We Ww 
Ws a Sentence 1 Sentence 2 
Question | 











Choice A Choice B Choice C Choice D 






h op 1 ho p ho pn 
Vs Vs Vs 


{ 





Question 








Accuracy (%) 


Supervised Learning 





(4) 


Naive Approaches: 


Accuracy (%) 


[Tseng & Lee, Interspeech 16] 
[Fane & Hsu & Lee, SLT 16] 


Supervised Learning 


Word-based Attention: 48.8% 
Memory Network: 39.2% 


(proposed by FB Al group) 





Naive Approaches: 


Outline 


Supervised Learning 





e Ultra Deep Network 
| New network structure 
e Attention Model 


Reinforcement Learning 


Unsupervised Learning 





e Image: Realizing what the World Looks Like 
e Text: Understanding the Meaning of Words 
e Audio: Learning human language without supervision 
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Scenario of Reinforcement 
Learning 





Observation . Action 


Bea o [0 
that 


Scenario of Reinforcement 
Learning 


Observation 





Agent learns to take actions to 


maximize expected reward 


Age ? " 


—— 















Reward 


http://www.sznews.com/news/conte 
nt/2013-11/26/content 8800180.htm 


ww αἱ bi 


Supervised v.s. Reinforcement 


* Supervised $- wee" ) Say "Hi" 


= Learning from 
teacher 3 “Bye bye’ | Say “Good bye" 


ο Reinforcement 


= ` 








Learning from 
critics 








Scenario of Reinforcement 


Learn Ng Agent learns to take actions to 
maximize expected reward. 





Observation Action 





«Ὁ: AlphaGo 







Reward 





If win, reward = 1 
If loss, reward = -1 
Otherwise, reward = 0 


Environment 


ww ai bbt* n uut 


Supervised v.s. Reinforcement 


* Supervised: 


Next move: 
3.3” 


Next move: 
“5-5” 





* Reinforcement Learning 


First move > 


Alpha Go is supervised learning + reinforcement learning. 











Difficulties of Reinforcement 
Learning 


* |t may be better to sacrifice immediate reward to 
gain more long-term reward 
* E.g. Playing Go 
e Agent s actions affect the subsequent data it 
receives 
* E.g. Exploration yi Ne 
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Deep Reinforcement Learning 









Observation Action 


Function 
Output 





Function 
Input 





Used to pick the 
best function 


Application: Interactive Retrieval 


* Interactive retrieval is helpful. [Wu & Lee, INTERSPEECH 16] 





“Deep Learning” related to Machine Learning? 
“Deep Learning" related to Education? 





Deep Reinforcement Learning 


* Different network depth 





120 : 
Some depth is needed. V 
100 
3 _ = 80 | <r 
μμ E The task cannot be addressed 
performance, 60 
- J by linear model. 
Ξ 


Less user labor | : m 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
More Interaction —» raining Epochs 


wwe] OG aro ee 2-layer A-layer 


More applications 


* Alpha Go, Playing Video Games, Dialogue 


* Flying Helicopter 
e https://www.youtube.com/watch?v20JLO4JJjocc 


* Driving 
e https://www.youtube.com/watch?vzOxo1Ldx3L 
SQ 
* Google Cuts Its Giant Electricity Bill With 
DeepMind-Powered AI 


« http://www.bloomberg.com/news/articles/2016-07- 
19/google-cuts-its-giant-electricity-bill-with-deepmind- 
powered-al --α... 





To learn deep reinforcement 
learning ...... 


ο Lectures of David Silver 


e http://wwwO.cs.ucl.ac.uk/staff/D.Silver/web/Te 
aching.html 


ο 10 lectures (1:30 each) 
* Deep Reinforcement Learning 


e http://videolectures.net/rldm2015 silver reinfo 
rcement learning/ 
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Outline 


Supervised Learning 





e Ultra Deep Network 


New network str Γ 
ο Attention Model } = μα”. 


Reinforcement Learning 









Unsupervised Learning 


e Image: Realizing what the World Looks Like 


e Text: Understanding the Meaning of Words 
e Audio: Learning human language without supervision 
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Does machine know what the 


world look like? 
Ref: https://openai.com/blog/generative-models/ 


-EH-HH EEE 





^ ΒΒ, i 










πε 
pee 





Deep Dream 


ο Given a photo, machine adds what it sees ...... 








Deep Dream 


e Given a photo, machine adds what it sees ...... 








Deep Style 


* Given a photo, make its style like famous paintings 





https V /dreamscópaapp.com/ 


Deep Style 


* Given a photo, make its style like famous paintings 


i 





https V /dreamscópaapp.com/ 


Deep Style 





content 





Generating Images by RNN 


b al color of color of color of 
γιο. m . . th pj 
2"? pixel 3rd pixel 4" pixel 


1t | 1 store 
l | | ἜΤΗ, 
d^ 








"T 






1. 
ΓΕ 7 ΓΚ] a 

color of color of color of 
1** pixel 2nd pixel 3rd pixel 
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Generating Images by RNN 


* Pixel Recurrent Neural Networks 
« https://arxiv.org/abs/1601.06759 










Real 
πο A 


2 





Generating Images 


* Training a decoder to generate images is 
unsupervised 


? OOdodeoO| 











Auto-encoder 









Not state-of- 
the-art 
approach 


Generating Images 


* Training a decoder to generate images is 
unsupervised 
* Variation Auto-encoder (VAE) 
* Ref: Auto-Encoding Variational Bayes, 
https://arxiv.org/abs/1312.6114 
ο Generative Adversarial Network (GAN) 


* Ref: Generative Adversarial Networks, 
EE LOE IUE, 1406.2661 


code Πω m. T 
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Which one is machine-generated? 





Ref: https://openai.cem/blog/generative-models/ 









https://github.com/mattya/chainer-DCGAN 


ΧΦ A | 
a - 4 ` 
ja 1 | ἡ 


| ὶ 
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Outline 


Supervised Learning 





e Ultra Deep Network 
| New network structure 
e Attention Model 


Reinforcement Learning 


Unsupervised Learning 





e Image: Realizing what the World Looks Like 


e Text: Understanding the Meaning of Words 


e Audio: Learning human language without supervision 
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Machine Reading 


* Machine learn the meaning of words from reading 
a lot of documents without supervision 








tuna?) upang 
Weet to Twitter youn) ypa 
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http://top-breaking-news.com/ 


Machine Reading 


* Machine learn the meaning of words from reading 
a lot of documents without supervision 





Word Vector / Embedding 






otree 





o flower 
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Machine Reading 


* Generating Word Vector/Embedding is 
unsupervised 


Apple 


Training data is a lot of text 
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https://garavato.files-Wordpress.com/2011/11/stacksdocuments.jpg?w-490 





Machine Reading 


* Machine learn the meaning of words from reading 
a lot of documents without supervision 


* A word can be understood by its context 














Ze. FR TL, are 


something very similar 






ΕΝ], 520E E LIA 


BEDEL 5208 FHM 
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0.8 -0.6 “04 -0.2 0 02 04 06 


Word Vector 


Germany scs NR 
Turkey nr 
Ankara 


Russia : 
'Moscow 
Canada + Ottawa 


Japan —Ó—— 8 Tokyo 
Vietnam τς απο 
China | Beijing 





Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014 
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V (Germany) 
Word Vector  & viperüin) — V(Rome) + V(Italy) 


* Characteristics 
V(hotter) — V(hot) 5 V(bigger) — V(big) 
V(Rome) —V(Italy) ~ V(Berlin) — V (Germany) 
V(king) — V(queen) ~ V(uncle) — V (aunt) 


* Solving analogies 


Rome : Italy - Berlin : ? 


Compute V(Berlin) — V(Rome) + V(Italy) 


Find the word w with the closest V(w) 
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Machine Reading 


* Machine learn the meaning of words from reading 
a lot of documents without supervision 
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Demo 


ο Model used in demo is provided by [1115 

ο Part of the project done by BRIE ` IRE 

* TA: JULY 

* Training data is from PTT (collected by SX Ε: ΕΕ) 
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Outline 


Supervised Learning 





e Ultra Deep Network 


New network str Γ 
ο Attention Model } = μα”. 


Reinforcement Learning 





Unsupervised Learning 


e Image: Realizing what the World Looks Like 
e Text: Understanding the Meaning of Words 


e Audio: Learning human language without supervision 
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Learning from Audio Book 


Machine does not have 
any prior knowledge 


Machine listens to lots of 
audio book 


Like an infant 





vaw ai bbt. com DO DD 000 [Chung, Interspeech 16) 





Audio Word to Vector 


* Audio segment corresponding to an unknown word 
mw» Fixed-length vector 











Audio Word to Vector 


ο The audio segments corresponding to words with 
Similar pronunciations are close to each other. 






dog 
D Ti "uw | never 
dogs 


Ver 
"Y 
a 
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ever ever 





sequence-to-sequence 
Auto-encoder 










WM 
UU 





| | | 
| | 
| 


audio segment 





> : vector 





RNN Encoder 








The vector we want 
How to train RNN Encoder? 


acoustic features 


"audio':seement 


Sequence-to-sequence 
Auto-e η cod er Input acoustic features 


RNN Encoder 


[ERI Put 


RNN Decoder 









The RNN encoder and 
decoder are jointly trained. 





acoustic features 





"audio':seement 


Audio Word to Vector 
- Results 


* Visualizing embedding vectors of the words 





970.25 -0.2 -0.15 -0.1 o0,05..0,,,.0.05 01 015 O02 0.25 





WaveNet (DeepMind) 


Output 


Hidden 
Layer 


Hidden 
Layer 


Hidden 
Layer 


Input 


ο ο ο ο ο οσο 
ω ὦ 00 00 0 0 00 00 000 
O00 00 ο ο 00 00 0000 
ο ο ο 00 00 00 00 ο ο ο ο 


https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 





Concluding Remarks 





Concluding Remarks 


Lecture l: Introduction of Deep Learning 


Lecture Ill: Variants of Neural Network 


Lecture IV: Next Wave 
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Al PFARRER TIE? 
* New Job in Al Age πρ. ; 








^ p^ me 9 b 
ος Ἂ 7 y . n 7j ῇ 
κ... ἃ A T Ha | 
4 A D Ρ > . 
NN VM. WV : 
AVA F > | [ l^" 
| Ὁ σ΄ | ν rt 


(eas EZ ` 
EUSHSLSRZE) 





http://www.express.co.uk/news/science/651202/First-step-towards-The-Terminator- 
becoming-reality-Al-beats-champ-of-world-s-oldest-game 














Step 3: pick 


Al 5) || 5 EI] define a set oodr | the best 


of function function 





ἘΞ π] SSRA ΑΙ all PR Eit 


e SoS RANK ο TE step 1 > ARISE EEH 


AYES REP RB] ie HRA 
© EsAN BESTE * AN [e RT RS n ERAS 
* ET RE BE TERE miisi 
RERIT e A—FERETE step 3 REE 
© E.g. / | EHNE Fe best function 
e FETE ΓΕ AAT ak EA * E.g. Deep Learning 


* 3 E Ye he 
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A 


1 dH 


ESA Al > Al SREYA AS 
nee Def | EG Al alll RIT EA 2E 





http://www.gvm.com.tw/web κ y 


only content 10787.html VAR E BEES ZA. B hes e ΕΔΕ {ΠΙΕΙΣ | 


D 


D -D 


D 


D 


D 


SAER EDBDEPEBEIESIASESAZB Bia: 
fa ESHA 204° "BLA E10 
[3 ^ESRaQ274 ZnAl 35 
EBS : 105.09.20 

eyes BB] : 105.10.04 ~ 10.12 
fa A E : 
https://comm.ntu.edu.tw/new/Master.php 
ΤΗ “Ε ΒΠΗΗῈΞ : 








= HS fey : 105.09.28 12:20 AN y) B ii: J R F , ΕΣΠΑ m 
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